Load packages
library(MASS) # ginv -- coefficient estimation
library(splines) # ns, bs -- spline curves
library(multcomp) # glht -- linear hypotheses
library(edgeR) # cpm, etc -- RNA-Seq normalization
library(limma) # lmFit, etc -- fitting many models
library(tidyverse) # working with data frames, plotting
Much of what we will be using is built into R without loading any packages.
Vectors and matrices
Vector operations
a <- c(3,4)
b <- c(5,6)
length(a)
## [1] 2
R performs operations elementwise:
a + b
## [1] 8 10
a * b
## [1] 15 24
We will be using the dot product a lot. This is:
sum(a*b)
## [1] 39
t(a) %*% b
## [,1]
## [1,] 39
The geometric length of a vector is (by Pythagorus):
sqrt(sum(a*a))
## [1] 5
Matrix operations
We can create a matrix with matrix, rbind (row bind), or cbind (column bind).
matrix(c(1,2,3,4), nrow=2, ncol=2)
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
rbind(c(1,3), c(2,4))
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
cbind(c(1,2), c(3,4))
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
X <- rbind(
c(1,0),
c(1,0),
c(1,1),
c(1,1))
X
## [,1] [,2]
## [1,] 1 0
## [2,] 1 0
## [3,] 1 1
## [4,] 1 1
class(X)
## [1] "matrix" "array"
The matrix transpose is obtained with t.
t(X)
## [,1] [,2] [,3] [,4]
## [1,] 1 1 1 1
## [2,] 0 0 1 1
Matrix multiplication is performed with %*%. The dot product of each row of the left hand side matrix and each column of the right hand side matrix is calculated. %*% treats a vector as either a single column or single row matrix as will make sense for matrix multiplication. Actually all we need today is to multiply a matrix by a vector, in which case we get the dot product of each row of the matrix with the vector.
X %*% a
## [,1]
## [1,] 3
## [2,] 3
## [3,] 7
## [4,] 7
as.vector(X %*% a)
## [1] 3 3 7 7
Challenge - use a dot product to calculate
The following dot product is an elaborate way to retrieve x[2]:
x <- c(10,20,30,40)
weights <- c(0,1,0,0) # <-- modify this line
sum(weights*x)
Modify weights in the above to calculate different quantities:
A. x[3]-x[2]
B. The mean of all four values.
Single numerical predictor
The age (year) and height (cm) of 10 people has been measured. We want a model that can predict height based on age.
people <- read_csv(
"age, height
10, 131
14, 147
16, 161
9, 136
16, 170
15, 160
15, 153
21, 187
9, 145
21, 195")
## Rows: 10 Columns: 2
## ── Column specification ───────────────────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): age, height
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ggplot(people, aes(x=age, y=height)) + geom_point()

fit <- lm(height ~ age, data=people)
fit
##
## Call:
## lm(formula = height ~ age, data = people)
##
## Coefficients:
## (Intercept) age
## 92.612 4.513
Coefficients are extracted with coef:
coef(fit)
## (Intercept) age
## 92.611502 4.512911
The residual standard deviation is extracted with sigma:
sigma(fit)
## [1] 7.263536
Behind the scenes a matrix of predictors has been produced from the mysterious notation ~ age. We can examine it explicitly:
model.matrix(fit)
model.matrix can be used without first calling lm.
model.matrix(~ age, data=people)
## (Intercept) age
## 1 1 10
## 2 1 14
## 3 1 16
## 4 1 9
## 5 1 16
## 6 1 15
## 7 1 15
## 8 1 21
## 9 1 9
## 10 1 21
## attr(,"assign")
## [1] 0 1
n=10 observations minus p=2 columns in the model matrix leaves 8 residual degrees of freedom:
df.residual(fit)
## [1] 8
Prediction
predict predicts. By default it produces predictions on the original dataset.
predict(fit)
## 1 2 3 4 5 6 7 8
## 137.7406 155.7923 164.8181 133.2277 164.8181 160.3052 160.3052 187.3826
## 9 10
## 133.2277 187.3826
predict(fit, interval="confidence")
## fit lwr upr
## 1 137.7406 129.8100 145.6712
## 2 155.7923 150.4399 161.1446
## 3 164.8181 159.2250 170.4111
## 4 133.2277 124.3009 142.1545
## 5 164.8181 159.2250 170.4111
## 6 160.3052 154.9836 165.6267
## 7 160.3052 154.9836 165.6267
## 8 187.3826 177.6105 197.1547
## 9 133.2277 124.3009 142.1545
## 10 187.3826 177.6105 197.1547
We can also calculate predictions manually.
# Prediction for a 15-year old
x <- c(1, 15)
beta <- coef(fit)
sum(x * beta)
## [1] 160.3052
# Prediction for all original data
X <- model.matrix(fit)
as.vector( X %*% beta )
## [1] 137.7406 155.7923 164.8181 133.2277 164.8181 160.3052 160.3052 187.3826
## [9] 133.2277 187.3826
predict can be used with new data.
new_people <- tibble(age=5:25)
predict(fit, new_people)
## 1 2 3 4 5 6 7 8
## 115.1761 119.6890 124.2019 128.7148 133.2277 137.7406 142.2535 146.7664
## 9 10 11 12 13 14 15 16
## 151.2793 155.7923 160.3052 164.8181 169.3310 173.8439 178.3568 182.8697
## 17 18 19 20 21
## 187.3826 191.8955 196.4085 200.9214 205.4343
new_predictions <- cbind(
new_people,
predict(fit, new_people, interval="confidence"))
ggplot() +
geom_ribbon(aes(x=age, ymin=lwr, ymax=upr), data=new_predictions, fill="grey") +
geom_line(aes(x=age, y=fit), data=new_predictions, color="blue") +
geom_point(aes(x=age, y=height), data=people) +
labs(y="height (cm)", x="age (year)",
subtitle="Ribbon shows 95% confidence interval of the model")

If you have ever used geom_smooth, it should now be a little less mysterious.
ggplot(people, aes(x=age, y=height)) + geom_smooth(method="lm") + geom_point()
## `geom_smooth()` using formula 'y ~ x'

Residuals
The residuals are the differences between predicted and actual values.
residuals(fit)
## 1 2 3 4 5 6 7
## -6.7406103 -8.7922535 -3.8180751 2.7723005 5.1819249 -0.3051643 -7.3051643
## 8 9 10
## -0.3826291 11.7723005 7.6173709
There should be no remaining relationship between predictions and the residuals (or between any individual predictors and the residual).
plot(predict(fit), residuals(fit))

A Q-Q (quantile-quantile) plot sorts the residuals and compares them to what would be expected from a normal distribution.
qqnorm(residuals(fit))
qqline(residuals(fit))

Ideally points would lie close to the line, but deviations are not a disaster. Our coefficient estimates will tend toward normally distributed errors even if the data does not, due to the Central Limit Theorem. Wild outliers should be investigated, as they may have a large effect on the model. We will see further examples of things to look for in a Q-Q plot in section 6.
plot(fit) produces a series of more sophisticated diagnostic plots.
plot(fit)
Single factor predictor, two levels
Consider a simple experiment where some outcome is measured for an untreated and a treated group. This can be viewed as a one-way ANalysis Of VAriance (ANOVA) experiment. (This is one of two senses in which the term ANOVA will be used today.)
outcomes <- read_csv(
"group, outcome
untreated, 4.98
untreated, 5.17
untreated, 5.66
untreated, 4.87
treated, 8.07
treated, 11.02
treated, 9.91")
## Rows: 7 Columns: 2
## ── Column specification ───────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): group
## dbl (1): outcome
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
outcomes$group <- factor(outcomes$group, c("untreated", "treated"))
outfit <- lm(outcome ~ group, data=outcomes)
outfit
##
## Call:
## lm(formula = outcome ~ group, data = outcomes)
##
## Coefficients:
## (Intercept) grouptreated
## 5.170 4.497
df.residual(outfit)
## [1] 5
sigma(outfit)
## [1] 0.9804353
model.matrix(outfit)
## (Intercept) grouptreated
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 0
## 5 1 1
## 6 1 1
## 7 1 1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
How coefficients are estimated
Coefficients are estimated from responses by multiplying by the “Moore-Penrose generalized inverse” of X. It can be useful to examine this to work out exactly what a fit is doing. Each row shows how the corresponding coefficient is estimated.
X <- model.matrix(outfit)
y <- outcomes$outcome
round(ginv(X), 3)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.25 0.25 0.25 0.25 0.000 0.000 0.000
## [2,] -0.25 -0.25 -0.25 -0.25 0.333 0.333 0.333
ginv(X) %*% y
## [,1]
## [1,] 5.170000
## [2,] 4.496667
Here we can see the first coefficient is the average of the “untreated” samples, and the second is the average of the “treated” samples minus that average of the “untreated” samples.
( y contains noise, assumed to be identically normally distributed for each observation. Transformation of this noise by ginv(X) tells us the distribution of errors in the coefficients (see vcov()). This can be further propagated to give the distribution of errors in predictions, and in other linear combinations of coefficients. )
Challenge - the meanings of coefficients
We now consider the formula outcome ~ 0 + group.
Examine the model matrix that will be used:
model.matrix(~ 0 + group, data=outcomes)
What column has been removed because 0 + was used?
R has responded to this by being a bit clever when representing the factor. What column has been added?
The mean of the untreated group is 5.2, the mean of the treated group is 9.7, and the difference between them is 4.5. Without using lm, what values should the coefficients have to best fit the data?
Now perform the actual linear model fit:
outfit2 <- lm(outcome ~ 0 + group, data=outcomes)
- Using
sigma, does the new model fit the data better or worse than the original?
Testing a hypothesis
Besides data with categorical predictors, the term ANOVA is used to refer to the use of the F test. Significance is judged based on comparing the Residual Sums of Squares of two models. We fit a model representing a null hypothesis. This model formula must nest within our original model formula: any prediction it can make must also be possible to be made by the original model formula. We compare the models using the anova function.
outfit0 <- lm(outcome ~ 1, data=outcomes)
anova(outfit0, outfit)
## Analysis of Variance Table
##
## Model 1: outcome ~ 1
## Model 2: outcome ~ group
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 6 39.469
## 2 5 4.806 1 34.663 36.06 0.001839 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Warning: This is not the only way to use the anova( ) function, but I think it is the safest way. Once we start using multiple predictors, the meaning of the output from anova with a single model is likely to be not quite what you want, read the documentation carefully. drop1(fit, test="F") should usually be used instead.
summary( ) also outputs p-values. Too many p-values, summary( ) doesn’t respect the hierarchy of terms in the model. The p-value for dropping the intercept is nonsense. The p-values are based on a t statistic, where F=t^2.
summary(outfit)
##
## Call:
## lm(formula = outcome ~ group, data = outcomes)
##
## Residuals:
## 1 2 3 4 5 6 7
## -1.900e-01 1.096e-15 4.900e-01 -3.000e-01 -1.597e+00 1.353e+00 2.433e-01
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.1700 0.4902 10.546 0.000132 ***
## grouptreated 4.4967 0.7488 6.005 0.001839 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9804 on 5 degrees of freedom
## Multiple R-squared: 0.8782, Adjusted R-squared: 0.8539
## F-statistic: 36.06 on 1 and 5 DF, p-value: 0.001839
confint( ) tells us not only that the difference between groups is non-zero but places a confidence interval on the difference. If the p-value were 0.05, the confidence interval would just touch zero. Whenever we reject a hypothesis that a single coefficient is zero, we may also conclude that we know its sign.
confint(outfit)
## 2.5 % 97.5 %
## (Intercept) 3.909855 6.430145
## grouptreated 2.571764 6.421569
These results exactly match those of a t-test.
t.test(outcomes$outcome[5:7], outcomes$outcome[1:4], var.equal=TRUE)
##
## Two Sample t-test
##
## data: outcomes$outcome[5:7] and outcomes$outcome[1:4]
## t = 6.005, df = 5, p-value = 0.001839
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.571764 6.421569
## sample estimates:
## mean of x mean of y
## 9.666667 5.170000
Challenge - does height change with age?
Return to the people dataset.
Can we reject the hypothesis that height is unrelated to age?
Compare the result to the outcome of a correlation test using cor.test( ).
What is the 95% confidence interval on the slope, in cm per year?
Multiple factors, many levels
Particle sizes of PVC plastic produced by a machine are measured. The machine is operated by three different people, and eight different batches of resin are used. Two measurements are made for each combination of these two experimental factors.
(This example is adapted from a data-set in the faraway package.)
pvc <- read_csv("r-linear-files/pvc.csv")
## Rows: 48 Columns: 3
## ── Column specification ───────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): operator, resin
## dbl (1): psize
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pvc$operator <- factor(pvc$operator)
pvc$resin <- factor(pvc$resin)
ggplot(pvc, aes(x=resin, y=psize)) + geom_point() + facet_grid(~operator)

Main effects
pvcfit1 <- lm(psize ~ operator + resin, data=pvc)
summary(pvcfit1)
##
## Call:
## lm(formula = psize ~ operator + resin, data = pvc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9500 -0.6125 -0.0167 0.4063 3.6833
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.2396 0.5226 69.345 < 2e-16 ***
## operatorBob -0.2625 0.4048 -0.648 0.52059
## operatorCarl -1.5063 0.4048 -3.721 0.00064 ***
## resinR2 -1.0333 0.6610 -1.563 0.12630
## resinR3 -5.8000 0.6610 -8.774 1.13e-10 ***
## resinR4 -6.1833 0.6610 -9.354 2.11e-11 ***
## resinR5 -4.8000 0.6610 -7.261 1.09e-08 ***
## resinR6 -5.4500 0.6610 -8.245 5.46e-10 ***
## resinR7 -2.9167 0.6610 -4.412 8.16e-05 ***
## resinR8 -0.1833 0.6610 -0.277 0.78302
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.145 on 38 degrees of freedom
## Multiple R-squared: 0.8595, Adjusted R-squared: 0.8262
## F-statistic: 25.82 on 9 and 38 DF, p-value: 1.474e-13
confint(pvcfit1)
## 2.5 % 97.5 %
## (Intercept) 35.181635 37.2975319
## operatorBob -1.081983 0.5569834
## operatorCarl -2.325733 -0.6867666
## resinR2 -2.371544 0.3048775
## resinR3 -7.138211 -4.4617892
## resinR4 -7.521544 -4.8451225
## resinR5 -6.138211 -3.4617892
## resinR6 -6.788211 -4.1117892
## resinR7 -4.254877 -1.5784559
## resinR8 -1.521544 1.1548775
This model assumes the influence of the two factors is additive, the model only contains the “main effects”. The meanings of the coefficients are:
- “(Intercept)” is the particle size for Alice and R1
- “operatorBob” is the step in particle size going from Alice to Bob
- “operatorCarl” is the step in particle size going from Alice to Carl
- “resinR2” is the step in particle size going from R1 to R2
- “resinR3” is the step in particle size going from R1 to R3
- (etc)
We can use anova( ) to test if there is evidence either of these main effects is important. For example, to test if there is evidence that the operator is important to the outcome we can test pvcfit1 against a model in which operator is dropped:
pvcfit0 <- lm(psize ~ resin, data=pvc)
anova(pvcfit0, pvcfit1)
## Analysis of Variance Table
##
## Model 1: psize ~ resin
## Model 2: psize ~ operator + resin
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 40 70.533
## 2 38 49.815 2 20.718 7.902 0.00135 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The drop1( ) function with test="F" can be used as a shorthand to test dropping each term that can be dropped from the model formula.
drop1(pvcfit1, test="F")
## Single term deletions
##
## Model:
## psize ~ operator + resin
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 49.82 21.782
## operator 2 20.718 70.53 34.474 7.902 0.00135 **
## resin 7 283.946 333.76 99.083 30.943 8.111e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interactions
We can ask if there is any interaction between the two factors. For example Carl might produce particularly small particles with R5. An additive model doesn’t allow for this.
pvcfit2 <- lm(psize ~ operator + resin + operator:resin, data=pvc)
# or
pvcfit2 <- lm(psize ~ operator*resin, data=pvc)
pvcfit2
##
## Call:
## lm(formula = psize ~ operator * resin, data = pvc)
##
## Coefficients:
## (Intercept) operatorBob operatorCarl
## 36.25 -0.85 -0.95
## resinR2 resinR3 resinR4
## -1.10 -5.55 -6.55
## resinR5 resinR6 resinR7
## -4.40 -6.05 -3.35
## resinR8 operatorBob:resinR2 operatorCarl:resinR2
## 0.55 1.05 -0.85
## operatorBob:resinR3 operatorCarl:resinR3 operatorBob:resinR4
## -0.20 -0.55 1.20
## operatorCarl:resinR4 operatorBob:resinR5 operatorCarl:resinR5
## -0.10 0.40 -1.60
## operatorBob:resinR6 operatorCarl:resinR6 operatorBob:resinR7
## 1.30 0.50 0.45
## operatorCarl:resinR7 operatorBob:resinR8 operatorCarl:resinR8
## 0.85 0.50 -2.70
This model allows for interactions between the two factors. There are enough predictors in the model matrix that each combination of levels can take on a distinct value. So we now have
- “(Intercept)” is particle size for Alice and R1
- “operatorBob” is the step in particle size from Alice to Bob, for R1
- “operatorCarl” is the step in particle size from Alice to Carl, for R1
- “resinR2” is the step in particle size from R1 to R2, for Alice
- (etc)
- “operatorBob:resinR2” is the particle size for Bob and R2, relative to
(Intercept)+operatorBob+resinR2.
- (etc)
anova(pvcfit1, pvcfit2)
## Analysis of Variance Table
##
## Model 1: psize ~ operator + resin
## Model 2: psize ~ operator * resin
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 38 49.815
## 2 24 35.480 14 14.335 0.6926 0.7599
Contrasts and confidence intervals
anova( ) lets us test if a particular factor or interaction is needed at all, and summary( ) allows us to see if any levels of a factor differ from the first level. However we may wish to perform different comparisons of the levels of a factor – this is called a “contrast”. We might also want to test some more complicated combination of coefficients such as a difference between two hypothetical individuals. In general this is called a “linear hypotheses” or a “general linear hypothesis”.
Say we want to compare Bob and Carl’s particle sizes. We will use the pvcfit1 model.
coef(pvcfit1)
## (Intercept) operatorBob operatorCarl resinR2 resinR3 resinR4
## 36.2395833 -0.2625000 -1.5062500 -1.0333333 -5.8000000 -6.1833333
## resinR5 resinR6 resinR7 resinR8
## -4.8000000 -5.4500000 -2.9166667 -0.1833333
K <- rbind(Carl_vs_Bob = c(0, -1,1, 0,0,0,0,0,0,0))
K %*% coef(pvcfit1)
## [,1]
## Carl_vs_Bob -1.24375
This is the estimated difference in particle size between Carl and Bob, but can we trust it? The glht function from multcomp can tell us. GLHT stands for General Linear Hypothesis Test.
library(multcomp)
result <- glht(pvcfit1, K)
result
##
## General Linear Hypotheses
##
## Linear Hypotheses:
## Estimate
## Carl_vs_Bob == 0 -1.244
summary(result)
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: lm(formula = psize ~ operator + resin, data = pvc)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## Carl_vs_Bob == 0 -1.2437 0.4048 -3.072 0.00391 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
confint(result)
##
## Simultaneous Confidence Intervals
##
## Fit: lm(formula = psize ~ operator + resin, data = pvc)
##
## Quantile = 2.0244
## 95% family-wise confidence level
##
##
## Linear Hypotheses:
## Estimate lwr upr
## Carl_vs_Bob == 0 -1.2437 -2.0632 -0.4243
glht can test multiple hypotheses at once. By default it applies a multiple testing correction when doing so. This is a generalization of Tukey’s Honestly Significant Differences.
K <- rbind(
Bob_vs_Alice = c(0, 1,0, 0,0,0,0,0,0,0),
Carl_vs_Alice = c(0, 0,1, 0,0,0,0,0,0,0),
Carl_vs_Bob = c(0, -1,1, 0,0,0,0,0,0,0))
result <- glht(pvcfit1, K)
summary(result)
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: lm(formula = psize ~ operator + resin, data = pvc)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## Bob_vs_Alice == 0 -0.2625 0.4048 -0.648 0.79437
## Carl_vs_Alice == 0 -1.5063 0.4048 -3.721 0.00189 **
## Carl_vs_Bob == 0 -1.2437 0.4048 -3.072 0.01067 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
confint(result)
##
## Simultaneous Confidence Intervals
##
## Fit: lm(formula = psize ~ operator + resin, data = pvc)
##
## Quantile = 2.4376
## 95% family-wise confidence level
##
##
## Linear Hypotheses:
## Estimate lwr upr
## Bob_vs_Alice == 0 -0.2625 -1.2493 0.7243
## Carl_vs_Alice == 0 -1.5063 -2.4930 -0.5195
## Carl_vs_Bob == 0 -1.2437 -2.2305 -0.2570
plot(result)

We can also turn off the multiple testing correction.
summary(result, test=adjusted("none"))
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: lm(formula = psize ~ operator + resin, data = pvc)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## Bob_vs_Alice == 0 -0.2625 0.4048 -0.648 0.52059
## Carl_vs_Alice == 0 -1.5063 0.4048 -3.721 0.00064 ***
## Carl_vs_Bob == 0 -1.2437 0.4048 -3.072 0.00391 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- none method)
A reasonable compromise between these extremes is Benjamini and Hochberg’s False Discovery Rate (FDR) correction.
summary(result, test=adjusted("fdr"))
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: lm(formula = psize ~ operator + resin, data = pvc)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## Bob_vs_Alice == 0 -0.2625 0.4048 -0.648 0.52059
## Carl_vs_Alice == 0 -1.5063 0.4048 -3.721 0.00192 **
## Carl_vs_Bob == 0 -1.2437 0.4048 -3.072 0.00587 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- fdr method)
Finally, we can ask if any of the linear combinations is non-zero, i.e. whether the model with all three constraints applied can be rejected. This is equivalent to the anova( ) tests we have done earlier. (Note that while we have three constraints, the degrees of freedom reduction is 2, as any 2 of the constraints are sufficient. This makes me uneasy as it is reliant on numerical accuracy, better to just use any two of the constraints.)
summary(result, test=Ftest())
##
## General Linear Hypotheses
##
## Linear Hypotheses:
## Estimate
## Bob_vs_Alice == 0 -0.2625
## Carl_vs_Alice == 0 -1.5063
## Carl_vs_Bob == 0 -1.2437
##
## Global Test:
## F DF1 DF2 Pr(>F)
## 1 7.902 2 38 0.00135
pvcfit0 <- lm(psize ~ resin, data=pvc)
anova(pvcfit0, pvcfit1)
## Analysis of Variance Table
##
## Model 1: psize ~ resin
## Model 2: psize ~ operator + resin
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 40 70.533
## 2 38 49.815 2 20.718 7.902 0.00135 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This demonstrates that the two methods of testing hypotheses–with the ANOVA test and with linear hypotheses–are equivalent.
Heteroscedasticity
ggplot(pvc, aes(x=resin, y=residuals(pvcfit1))) +
geom_point() + geom_hline(yintercept=0) + facet_grid(~operator)

Our assumption that the residual noise is uniformly normally distributed may not hold. Carl’s data seems to have greater standard deviation than Alice or Bob’s. When comparing Alice and Bob’s results, including Carl’s data in the model may alter the outcome.
Challenge - examine contrasts
Using the pvcfit1 model, construct linear hypotheses to see if the effect of:
- R8 is different to R4
- R2 is different to R1
An easier way to specify contrasts
So far we have been manually constructing linear hypotheses. The multcomp package automates this for some common situations. To compare all pairs of levels within a factor, use the mcp function, giving the factor to test as the name of an argument and specifying to test “Tukey” contrasts:
result <- glht(pvcfit1, mcp(resin="Tukey"))
To compare the first level of a factor to all other levels, specify to test “Dunnett” contrasts:
result <- glht(pvcfit1, mcp(resin="Dunnett"))
The linear hypotheses actually used can be inspected with result$linfct.
The emmeans package also automates many common comparisons. In particular, if you are working with models including interactions this package provides results for an “average” individual when examining main effects. By default it treats each level of other factors as being equally likely when calculating this average. It’s up to you to decide if this is sensible!
library(emmeans)
emmeans(pvcfit1, ~ resin)
emmeans(pvcfit1, pairwise ~ resin)
emmeans(pvcfit1, trt.vs.ctrl ~ resin)
confint( emmeans(pvcfit1, trt.vs.ctrl ~ resin)$contrasts )
Gene expression example
Tooth growth in mouse embryos is studied using RNA-Seq. The RNA expression levels of several genes are examined in the cells that form the upper and lower first molars, in eight individual mouse embryos that have been dissected after different times of embryo development. The measurements are in terms of “Reads Per Million”, essentially the fraction of RNA in each sample belonging to each gene, times 1 million.
(This data was extracted from ARCHS4 (https://amp.pharm.mssm.edu/archs4/). In the Gene Expression Omnibus it is entry GSE76316. The sample descriptions in GEO seem to be out of order, but reading the associated paper and the genes they talk about I think I have the correct order of samples!)
teeth <- read_csv("r-linear-files/teeth.csv")
## Rows: 16 Columns: 8
## ── Column specification ───────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): sample, mouse, tooth
## dbl (5): day, gene_ace, gene_smoc1, gene_pou3f3, gene_wnt2
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
teeth$tooth <- factor(teeth$tooth, c("lower","upper"))
teeth$mouse <- factor(teeth$mouse)
It will be convenient to have a quick way to examine different genes and different models with this data.
# A convenience to examine different model fits
more_data <- expand.grid(
day=seq(14.3,18.2,by=0.01),
tooth=as_factor(c("lower","upper")))
look <- function(y, fit=NULL) {
p <- ggplot(teeth,aes(x=day,group=tooth))
if (!is.null(fit)) {
more_ci <- cbind(
more_data,
predict(fit, more_data, interval="confidence"))
p <- p +
geom_ribbon(data=more_ci, aes(ymin=lwr,ymax=upr),alpha=0.1) +
geom_line(data=more_ci,aes(y=fit,color=tooth))
}
p + geom_point(aes(y=y,color=tooth)) +
labs(y=deparse(substitute(y)))
}
# Try it out
look(teeth$gene_ace)

We could treat day as a categorical variable, as in the previous section. However let us treat it as numerical, and see where that leads.
Transformation
Ace gene
acefit <- lm(gene_ace ~ tooth + day, data=teeth)
look(teeth$gene_ace, acefit)

Two problems:
- The actual data appears to be curved, our straight lines are not a good fit.
- The predictions fall below zero, a physical impossibility.
In this case, log transformation of the data will solve both these problems.
log2_acefit <- lm( log2(gene_ace) ~ tooth + day, data=teeth)
look(log2(teeth$gene_ace), log2_acefit)

Various transformations of y are possible. Log transformation is commonly used in the context of gene expression. Square root transformation can also be appropriate with nicely behaved count data (technically, if the errors follow a Poisson distribution). This gene expression data is ultimately count based, but is overdispersed compared to the Poisson distribution so square root transformation isn’t appropriate in this case. The Box-Cox transformations provide a spectrum of further options.
Pou3f3 gene
In the case of the Pou3f3 gene, the log transformation is even more important. It looks like gene expression changes at different rates in the upper and lower molars, that is there is a significant interaction between tooth and day.
pou3f3fit0 <- lm(gene_pou3f3 ~ tooth + day, data=teeth)
look(teeth$gene_pou3f3, pou3f3fit0)

pou3f3fit1 <- lm(gene_pou3f3 ~ tooth * day, data=teeth)
look(teeth$gene_pou3f3, pou3f3fit1)

anova(pou3f3fit0, pou3f3fit1)
## Analysis of Variance Table
##
## Model 1: gene_pou3f3 ~ tooth + day
## Model 2: gene_pou3f3 ~ tooth * day
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 13 4849.7
## 2 12 415.2 1 4434.5 128.15 9.234e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(pou3f3fit1)["toothupper:day",]
## 2.5 % 97.5 %
## -34.65685 -23.46937
The slopes of the lines confidently differ by at least 23.5 RPM per day.
Examining the residuals reveals a further problem: larger expression values are associated with larger residuals.
look(residuals(pou3f3fit1))

plot(predict(pou3f3fit1), residuals(pou3f3fit1))

qqnorm(residuals(pou3f3fit1))
qqline(residuals(pou3f3fit1))

Log transformation both removes the interaction and makes the residuals more uniform (except for one outlier).
log2_pou3f3fit0 <- lm(log2(gene_pou3f3) ~ tooth + day, data=teeth)
log2_pou3f3fit1 <- lm(log2(gene_pou3f3) ~ tooth * day, data=teeth)
anova(log2_pou3f3fit0, log2_pou3f3fit1)
## Analysis of Variance Table
##
## Model 1: log2(gene_pou3f3) ~ tooth + day
## Model 2: log2(gene_pou3f3) ~ tooth * day
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 13 0.56714
## 2 12 0.56579 1 0.0013576 0.0288 0.8681
confint(log2_pou3f3fit1)["toothupper:day",]
## 2.5 % 97.5 %
## -0.2225600 0.1903981
The difference of the slopes of log2 gene expression between lower and upper molars is confidently between -0.22 and 0.19.
look(log2(teeth$gene_pou3f3), log2_pou3f3fit0)

qqnorm(residuals(log2_pou3f3fit0))
qqline(residuals(log2_pou3f3fit0))

Curve fitting
Smoc1 gene
log2_smoc1fit <- lm(log2(gene_smoc1) ~ tooth + day, data=teeth)
look(log2(teeth$gene_smoc1), log2_smoc1fit)

In this case, log transformation does not remove the curve. If you think this is a problem for linear models, you are mistaken! With a little feature engineering we can fit a quadratic curve. Calculations can be included in the formula if wrapped in I( ):
curved_fit <- lm(log2(gene_smoc1) ~ tooth + day + I(day^2), data=teeth)
look(log2(teeth$gene_smoc1), curved_fit)

Another way to do this would be to add the column to the data frame:
teeth$day_squared <- teeth$day^2
curved_fit2 <- lm(log2(gene_smoc1) ~ tooth + day + day_squared, data=teeth)
Finally, the poly( ) function can be used in a formula to fit polynomials of arbitrary degree. poly will encode day slightly differently, but produces an equivalent fit.
curved_fit3 <- lm(log2(gene_smoc1) ~ tooth + poly(day,2), data=teeth)
sigma(curved_fit)
## [1] 0.1630425
sigma(curved_fit2)
## [1] 0.1630425
sigma(curved_fit3)
## [1] 0.1630425
poly( ) can also be used to fit higher order polynomials, but these tend to become wobbly and extrapolate poorly. A better option may be to use the ns( ) or bs( ) functions in the splines package, which can be used to fit piecewise “B-splines”. In particular ns( ) (natural spline) is appealing because it extrapolates beyond the ends only with straight lines. If the data is cyclic (for example cell cycle or circadian time series), sine and cosine terms can be used to fit some number of harmonics from a Fourier series.
library(splines)
spline_fit <- lm(log2(gene_smoc1) ~ tooth * ns(day,3), data=teeth)
look(log2(teeth$gene_smoc1), spline_fit)

Day is confounded with mouse
There may be individual differences between mice. We would like to take this into our account in a model. In general it is common to include batch effect terms in a model in order to correctly model the data (and increase the significance level of results), even if they are not directly of interest.
badfit <- lm(log2(gene_ace) ~ tooth + day + mouse, data=teeth)
summary(badfit)
##
## Call:
## lm(formula = log2(gene_ace) ~ tooth + day + mouse, data = teeth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.20562 -0.02335 0.00000 0.02335 0.20562
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.115920 0.658430 -19.920 2.01e-07 ***
## toothupper -0.280467 0.070399 -3.984 0.0053 **
## day 0.992214 0.040228 24.665 4.59e-08 ***
## mouseind2 0.100073 0.131897 0.759 0.4728
## mouseind3 -0.198260 0.125612 -1.578 0.1585
## mouseind4 -0.122056 0.122349 -0.998 0.3517
## mouseind5 -0.203226 0.122349 -1.661 0.1407
## mouseind6 -0.009452 0.125612 -0.075 0.9421
## mouseind7 -0.042441 0.131897 -0.322 0.7570
## mouseind8 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1408 on 7 degrees of freedom
## Multiple R-squared: 0.9934, Adjusted R-squared: 0.9859
## F-statistic: 131.9 on 8 and 7 DF, p-value: 6.131e-07
In this case this is not possible, and R has arbirarily dropped a predictor from the model. As a different mouse produced data for each different day, mouse is confounded with day. day can be constructed as a linear combination of the intercept term and the mouse terms. The model suffers from multicollinearity.
Another example of confounding would be an experiment in which each treatment is done in a separate batch.
Even if predictors are not perfectly multicollinear, correlation between predictors can make their estimates inaccurate. One way to check for this is to attempt to predict each of the predictors with a linear model that uses the remaining predictors (see “Variance Inflation Factor”).
A possible solution to this problem would be to use a “mixed model”, but this is beyond the scope of today’s workshop.
Challenge - Wnt2 gene (20 minutes)
Look at the expression of gene Wnt2 in column gene_wnt2.
Try some different model formulas.
Justify a particular model by rejecting simpler alternatives using anova( ).
Testing many genes with limma
In this section we look at fitting the same matrix of predictors X to many different sets of responses y. We will use the package limma from Bioconductor. This is a very brief demonstration, and there is much more to this package. See the RNAseq123 tutorial at https://bioconductor.org/packages/release/workflows/vignettes/RNAseq123/inst/doc/limmaWorkflow.html and the usersguide.pdf at https://bioconductor.org/packages/release/bioc/html/limma.html
library(edgeR)
library(limma)
Load the data
Actually in the teeth dataset, the expression level of all genes was measured!
counts_df <- read_csv("r-linear-files/teeth-read-counts.csv")
counts <- as.matrix( select(counts_df, -gene) )
rownames(counts) <- counts_df$gene
dim(counts)
## [1] 32544 16
counts[1:5,]
## ind1lower ind1upper ind2lower ind2upper ind3lower ind3upper
## 0610007P14Rik 1721 1846 1708 1877 1443 1534
## 0610009B22Rik 512 466 554 504 501 498
## 0610009L18Rik 68 71 121 129 74 124
## 0610009O20Rik 2583 2797 2840 2975 2675 2690
## 0610010F05Rik 1079 1246 1262 992 1232 992
## ind4lower ind4upper ind5lower ind5upper ind6lower ind6upper
## 0610007P14Rik 1389 1439 1699 1881 1697 1716
## 0610009B22Rik 478 427 569 614 577 568
## 0610009L18Rik 83 87 86 125 99 135
## 0610009O20Rik 2429 2313 2693 3093 2854 3066
## 0610010F05Rik 1011 977 1175 1384 1235 1221
## ind7lower ind7upper ind8lower ind8upper
## 0610007P14Rik 1531 1625 1716 1373
## 0610009B22Rik 587 531 611 453
## 0610009L18Rik 130 132 114 92
## 0610009O20Rik 2487 2904 2713 2455
## 0610010F05Rik 1042 998 1079 882
Model matrix
The column names match our teeth data frame.
colnames(counts)
## [1] "ind1lower" "ind1upper" "ind2lower" "ind2upper" "ind3lower" "ind3upper"
## [7] "ind4lower" "ind4upper" "ind5lower" "ind5upper" "ind6lower" "ind6upper"
## [13] "ind7lower" "ind7upper" "ind8lower" "ind8upper"
teeth$sample
## [1] "ind1lower" "ind1upper" "ind2lower" "ind2upper" "ind3lower" "ind3upper"
## [7] "ind4lower" "ind4upper" "ind5lower" "ind5upper" "ind6lower" "ind6upper"
## [13] "ind7lower" "ind7upper" "ind8lower" "ind8upper"
We will use limma to fit a linear model to each gene. The same model formula will be used in each case. limma doesn’t automatically convert a formula into a model matrix, so we have to do this step manually. Here I am using a model formula that treats the upper and lower teeth as following a different linear trend over time.
X <- model.matrix(~ tooth * day, data=teeth)
X
## (Intercept) toothupper day toothupper:day
## 1 1 0 14.5 0.0
## 2 1 1 14.5 14.5
## 3 1 0 15.0 0.0
## 4 1 1 15.0 15.0
## 5 1 0 15.5 0.0
## 6 1 1 15.5 15.5
## 7 1 0 16.0 0.0
## 8 1 1 16.0 16.0
## 9 1 0 16.5 0.0
## 10 1 1 16.5 16.5
## 11 1 0 17.0 0.0
## 12 1 1 17.0 17.0
## 13 1 0 17.5 0.0
## 14 1 1 17.5 17.5
## 15 1 0 18.0 0.0
## 16 1 1 18.0 18.0
## attr(,"assign")
## [1] 0 1 2 3
## attr(,"contrasts")
## attr(,"contrasts")$tooth
## [1] "contr.treatment"
limma refers to this matrix as the “design”.
Remove low expression genes
There is little chance of detecting differential expression in genes with very low read counts. Including these genes will require a larger False Discovery Rate correction, and also confuses limma’s Empirical Bayes parameter estimation. edgeR provides a function filterByExpr that performs their recommended level of filtering. The filtering takes into account the complexity of the model being fitted, so the “design” matrix is also needed.
keep <- filterByExpr(counts, design=X)
table(keep)
## keep
## FALSE TRUE
## 15333 17211
counts_kept <- counts[keep, ]
dim(counts_kept)
## [1] 17211 16
Fitting a model to and testing each gene
We are now ready to fit a linear model to each gene.
fit <- lmFit(log2_cpms, X)
class(fit)
## [1] "MArrayLM"
## attr(,"package")
## [1] "limma"
fit$coefficients[1:5,]
## (Intercept) toothupper day toothupper:day
## 0610007P14Rik 5.807656 1.3158722 -0.03179865 -0.08061826
## 0610009B22Rik 3.136273 0.5999070 0.03696200 -0.04605341
## 0610009L18Rik -0.902155 1.4890594 0.13042420 -0.08089679
## 0610009O20Rik 6.611072 0.2355852 -0.03671146 -0.01261433
## 0610010F05Rik 5.819858 0.2543831 -0.06338421 -0.02250952
Significance testing in limma is by the use of linear hypotheses (which limma refers to as “contrasts”). A difference between glht and limma’s contrasts.fit is that limma expects the linear combinations to be in columns rather than rows.
We will first look for genes where the slope over time is not flat, averaging the lower and upper teeth.
# Lower slope: c(0,0,1,0)
# Upper slope: c(0,0,1,1)
K <- rbind( average_slope=c(0,0,1,0.5) )
cfit <- contrasts.fit(fit, t(K)) #linear combinations in columns!
efit <- eBayes(cfit, trend=TRUE)
The call to eBayes does Empirical Bayes squeezing of the residual variance for each gene (see Appendix B). This is a bit of magic that allows limma to work well with small numbers of samples.
topTable(efit)
## logFC AveExpr t P.Value adj.P.Val B
## Rnf144b 0.6705081 4.483834 33.64061 8.221793e-16 1.415053e-11 25.97540
## Tfap2b -0.6416163 7.468730 -28.77129 8.766956e-15 3.581088e-11 23.92465
## Six2 -0.5576619 6.978666 -28.60904 9.548454e-15 3.581088e-11 23.84856
## Vldlr 0.4147078 6.814521 28.51996 1.000874e-14 3.581088e-11 23.80655
## Prom2 0.9165496 3.676863 28.27714 1.138751e-14 3.581088e-11 23.69118
## Ace 0.9858071 2.813200 28.06009 1.279153e-14 3.581088e-11 23.58699
## Stk32a 1.2259001 3.708815 27.81963 1.456488e-14 3.581088e-11 23.47034
## Bambi 0.5612789 6.104506 26.74115 2.643012e-14 5.686111e-11 22.93116
## Msx2 0.8126541 6.253647 25.67587 4.872568e-14 9.317974e-11 22.37141
## Cox4i2 0.5686825 3.809406 24.67488 8.854784e-14 1.523997e-10 21.81903
The column adj.P.Val contains FDR adjusted p-values.
all_results <- topTable(efit, n=Inf)
significant <- all_results$adj.P.Val <= 0.05
table(significant)
## significant
## FALSE TRUE
## 10423 6788
ggplot(all_results, aes(x=AveExpr, y=logFC)) +
geom_point(size=0.1) +
geom_point(data=all_results[significant,], size=0.5, color="red")

Relation to lm( ) and glht( )
Let’s look at a specific gene.
rnf144b <- log2_cpms["Rnf144b",]
rnf144b_fit <- lm(rnf144b ~ tooth * day, data=teeth)
look(rnf144b, rnf144b_fit)

We can use the same linear hypothesis with glht. The estimate is the same, but limma has gained some power by shrinking the variance toward the trend line, so limma’s p-value is smaller.
summary( glht(rnf144b_fit, K) )
##
## Simultaneous Tests for General Linear Hypotheses
##
## Fit: lm(formula = rnf144b ~ tooth * day, data = teeth)
##
## Linear Hypotheses:
## Estimate Std. Error t value Pr(>|t|)
## average_slope == 0 0.67051 0.01875 35.77 1.46e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)
Confidence intervals
Confidence Intervals should also be of interest. However note that these are not adjusted for multiple testing (see Appendix B).
topTable(efit, confint=0.95)
## logFC CI.L CI.R AveExpr t P.Value
## Rnf144b 0.6705081 0.6281126 0.7129035 4.483834 33.64061 8.221793e-16
## Tfap2b -0.6416163 -0.6890509 -0.5941817 7.468730 -28.77129 8.766956e-15
## Six2 -0.5576619 -0.5991235 -0.5162002 6.978666 -28.60904 9.548454e-15
## Vldlr 0.4147078 0.3837783 0.4456372 6.814521 28.51996 1.000874e-14
## Prom2 0.9165496 0.8476051 0.9854941 3.676863 28.27714 1.138751e-14
## Ace 0.9858071 0.9110794 1.0605348 2.813200 28.06009 1.279153e-14
## Stk32a 1.2259001 1.1321692 1.3196310 3.708815 27.81963 1.456488e-14
## Bambi 0.5612789 0.5166334 0.6059244 6.104506 26.74115 2.643012e-14
## Msx2 0.8126541 0.7453317 0.8799765 6.253647 25.67587 4.872568e-14
## Cox4i2 0.5686825 0.5196602 0.6177048 3.809406 24.67488 8.854784e-14
## adj.P.Val B
## Rnf144b 1.415053e-11 25.97540
## Tfap2b 3.581088e-11 23.92465
## Six2 3.581088e-11 23.84856
## Vldlr 3.581088e-11 23.80655
## Prom2 3.581088e-11 23.69118
## Ace 3.581088e-11 23.58699
## Stk32a 3.581088e-11 23.47034
## Bambi 5.686111e-11 22.93116
## Msx2 9.317974e-11 22.37141
## Cox4i2 1.523997e-10 21.81903
F test
limma can also test several simultaneous constraints on linear combinations of coefficients. Suppose we want to find any deviation from a constant expression level. We can check for this with:
K2 <- rbind(
c(0,1,0,0),
c(0,0,1,0),
c(0,0,0,1))
cfit2 <- contrasts.fit(fit, t(K2))
efit2 <- eBayes(cfit2, trend=TRUE)
topTable(efit2)
## Coef1 Coef2 Coef3 AveExpr F P.Value
## Pou3f3 3.9677657 -0.3367111 -0.01422066 5.1309354 508.0331 1.388933e-15
## Rnf144b -1.4817168 0.6363724 0.06827130 4.4838343 400.3615 8.469656e-15
## Six2 -0.8370468 -0.6082833 0.10124287 6.9786656 384.2097 1.157124e-14
## Tfap2b -4.3550546 -0.7909806 0.29872853 7.4687297 322.7085 4.330436e-14
## Prom2 0.5804985 0.9612842 -0.08946912 3.6768631 313.2648 5.419522e-14
## Vldlr 2.1955450 0.4751431 -0.12087066 6.8145206 292.9660 8.985449e-14
## Nkx2-3 -12.4481924 -0.4507673 0.20535681 0.4715549 280.1517 1.258886e-13
## Ace -0.5231634 0.9784323 0.01474955 2.8132003 266.6050 1.828639e-13
## Stk32a 0.8473950 1.2638212 -0.07584210 3.7088148 263.0716 2.021843e-13
## Bambi 1.5414027 0.6171353 -0.11171292 6.1045062 251.5382 2.832907e-13
## adj.P.Val
## Pou3f3 2.390492e-11
## Rnf144b 6.638423e-11
## Six2 6.638423e-11
## Tfap2b 1.863278e-10
## Prom2 1.865508e-10
## Vldlr 2.577476e-10
## Nkx2-3 3.095240e-10
## Ace 3.866438e-10
## Stk32a 3.866438e-10
## Bambi 4.875716e-10
A shortcut here would be to skip the contrasts.fit step and specify a set of coefficients directly to topTable( ).
Further exercises
Construct and use linear combinations to find genes that:
A. Differ in slope between lower and upper molars.
B. Differ in expression on day 16 between the lower and upper molars. (Hint: this can be viewed as the difference in predictions between two individual samples.)
C. Construct a pair of linear combinations that when used together in an F test find genes with non-zero slope in either or both the lower and upper molars.
Construct a model matrix in which the change over time is quadratic, and some linear combinations to interrogate it.
Appendix A - mixed models
In this workshop we have looked at linear models with fixed effects only. A linear mixed model contains both fixed and random effects. Random effects are appropriate for individuals drawn from a population. The model we obtain will include how much variation there is in this population.
Mixed models would be a whole further workshop, but I will attempt a rough outline here.
In R, mixed models are most commonly fitted using the lmer function in lme4. A point of confusion you will encounter is that there are a variety of ways to perform significance tests. The authors of lme4 duck out of this debate entirely and only provide t statistics. However packages such as lmerTest, emmeans and pbkrtest can provide p-values and CIs from an lmer model.
My recommendation of a safe choice, from my reading, is to fit a model using the REML method (the default for lmer) and then test it using the Kenward-Roger method. The Satterthwaite method is also decent, usually producing nearly identical results with less computation. Other methods may produce p-values that are too small with small data sets. This includes the multcomp package that we used earlier, unfortunately. For more information see:
Luke, S.G. (2017). Evaluating significance in linear mixed-effects models in R. Behav Res 49, 1494–1502 https://doi.org/10.3758/s13428-016-0809-y
The differences between methods disappear when your data is large enough, but “large enough” might mean not just many rows in your data frame but also many levels for your random effects.
I will briefly demonstrate lmer with the PVC particle dataset, now treating resin batches as a random effect.
library(lmerTest) # lmerTest builds on lme4, can produce p-values
library(emmeans) # emmeans can provide Kenward-Roger CIs
pvc <- read_csv("r-linear-files/pvc.csv")
pvc$operator <- factor(pvc$operator)
pvc$resin <- factor(pvc$resin)
pvcfit_mixed <- lmer(psize ~ operator + (1|resin), data=pvc, REML=TRUE)
pvcfit_mixed
## Linear mixed model fit by REML ['lmerModLmerTest']
## Formula: psize ~ operator + (1 | resin)
## Data: pvc
## REML criterion at convergence: 172.2304
## Random effects:
## Groups Name Std.Dev.
## resin (Intercept) 2.558
## Residual 1.145
## Number of obs: 48, groups: resin, 8
## Fixed Effects:
## (Intercept) operatorBob operatorCarl
## 32.9437 -0.2625 -1.5062
We use new ( | ) notation to specify random effects. Right of the | you give the factor specifying individuals from a population. Left of the | you give predictors whose coefficients may vary between individuals. In the example above we are saying that the intercept may vary between individuals.
The mean about which individuals vary must also be included as a fixed effect in the model! Here an intercept term is implicitly included in the model as usual. However if your random effect specified a slope or some-such it would also need to be given as a fixed effect. Otherwise you would be saying the slope varies around zero.
A standard deviation has been estimated for the effect of different batches of resin on particle size, as well as the usual residual standard deviation. This is an extra source of random variation around the prediction from the fixed effect part of the model.
drop1 from lmerTest can be used to test whether fixed effects are statistically significant.
drop1(pvcfit_mixed, ddf="Kenward-Roger")
## Single term deletions using Kenward-Roger's method:
##
## Model:
## psize ~ operator + (1 | resin)
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## operator 20.718 10.359 2 38 7.902 0.00135 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ranova from lmerTest can be used to test whether random effects are statistically significant.
ranova(pvcfit_mixed)
## ANOVA-like table for random-effects: Single term deletions
##
## Model:
## psize ~ operator + (1 | resin)
## npar logLik AIC LRT Df Pr(>Chisq)
## <none> 5 -86.115 182.23
## (1 | resin) 4 -113.096 234.19 53.961 1 2.045e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
emmeans can be used on the fixed effects of the model.
result <- emmeans(pvcfit_mixed, pairwise ~ operator, lmer.df="Kenward-Roger")
result$emmeans
## operator emmean SE df lower.CL upper.CL
## Alice 32.9 0.949 7.93 30.8 35.1
## Bob 32.7 0.949 7.93 30.5 34.9
## Carl 31.4 0.949 7.93 29.2 33.6
##
## Degrees-of-freedom method: kenward-roger
## Confidence level used: 0.95
confint(result$contrasts)
## contrast estimate SE df lower.CL upper.CL
## Alice - Bob 0.263 0.405 38 -0.725 1.25
## Alice - Carl 1.506 0.405 38 0.519 2.49
## Bob - Carl 1.244 0.405 38 0.257 2.23
##
## Degrees-of-freedom method: kenward-roger
## Confidence level used: 0.95
## Conf-level adjustment: tukey method for comparing a family of 3 estimates
For comparison, here are similar results from the fixed effects model:
pvcfit_fixed <- lm(psize ~ operator + resin, data=pvc)
drop1(pvcfit_fixed, test="F")
## Single term deletions
##
## Model:
## psize ~ operator + resin
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 49.82 21.782
## operator 2 20.718 70.53 34.474 7.902 0.00135 **
## resin 7 283.946 333.76 99.083 30.943 8.111e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
result_fixed <- emmeans(pvcfit_fixed, pairwise ~ operator)
result_fixed$emmeans
## operator emmean SE df lower.CL upper.CL
## Alice 32.9 0.286 38 32.4 33.5
## Bob 32.7 0.286 38 32.1 33.3
## Carl 31.4 0.286 38 30.9 32.0
##
## Results are averaged over the levels of: resin
## Confidence level used: 0.95
confint(result_fixed$contrasts)
## contrast estimate SE df lower.CL upper.CL
## Alice - Bob 0.263 0.405 38 -0.725 1.25
## Alice - Carl 1.506 0.405 38 0.519 2.49
## Bob - Carl 1.244 0.405 38 0.257 2.23
##
## Results are averaged over the levels of: resin
## Confidence level used: 0.95
## Conf-level adjustment: tukey method for comparing a family of 3 estimates
The simpler fixed effects model can be a useful check that you are fitting the mixed effects model you intend. The results will usually be similar, and any differences may be instructive. A fixed effect model might be chosen either by ignoring random effects or including them as fixed effects. In this example we see:
The pairwise comparisons are identical.
CIs for the means for each operator are wider for the mixed model. This is because they are an estimate of the mean for the whole population of resins. For the fixed effects model, we get the CI for the mean of exactly the eight resins present in the data.
Appendix B - more on limma
Empirical Bayes variance squeezing
In limma, Empirical Bayes squeezing of the residual variance acts as though we have some number of extra “prior” observations of the variance. These are also counted as extra degrees of freedom in F tests. The “prior” observations act to squeeze the estimated residual variance toward a trend line that is a function of the average expression level.
efit <- eBayes(cfit, trend=TRUE)
efit$df.prior
## [1] 3.36434
efit$df.residual
## [1] 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12
## [ reached getOption("max.print") -- omitted 17190 entries ]
efit$df.total
## [1] 15.36434 15.36434 15.36434 15.36434 15.36434 15.36434 15.36434 15.36434
## [9] 15.36434 15.36434 15.36434 15.36434 15.36434 15.36434 15.36434 15.36434
## [17] 15.36434 15.36434 15.36434 15.36434 15.36434
## [ reached getOption("max.print") -- omitted 17190 entries ]
plotSA(efit)
points(efit$Amean, efit$s2.post^0.25, col="red", cex=0.2)

The total effective degrees of freedom is the “prior” degrees of freedom plus the normal residual degrees of freedom. As can be seen in the plot, compared to the residual variance (black dots), this produces a posterior residual variance (efit$s2.post, red dots) that is squeezed toward the trend line.
It’s worthwhile checking df.prior when using limma, as a low value may indicate a problem with a data-set.
Top confident effect sizes
A little self-promotion: With limma we are able to find genes where our effect of interest is significantly different from zero. However we may make many thousands of discoveries, too many to easily follow up, and some of the effects discovered may be tiny. I have developed a method to rank genes by confident effect size, with multiple testing correction, implemented in the package topconfects. This method builds on the treat method provided by limma.
This is performed with the limma_confects function. limma_confects incorporates the Empirical Bayes variance squeezing step, so remember to specify trend=TRUE if using trend based variance squeezing.
library(topconfects)
## Warning: package 'topconfects' was built under R version 4.1.1
result <- limma_confects(cfit, "average_slope", trend=TRUE, fdr=0.05)
result
## $table
## confect effect AveExpr name
## 1 -1.411 -2.655 1.555 Hbb-y
## 2 1.356 2.011 4.139 Mmp13
## 3 1.318 2.302 -1.784 Scgn
## 4 1.290 1.796 2.480 Ecel1
## 5 1.248 1.843 -1.503 Sdr16c6
## 6 1.233 2.095 1.197 Tbx22
## 7 -1.201 -1.701 0.688 Hba-x
## 8 1.171 2.015 3.501 Igfbpl1
## 9 1.075 1.706 5.724 Vwde
## 10 1.075 1.694 3.209 Adgrf4
## ...
## 6788 of 17211 non-zero log2 fold change at FDR 0.05
## Prior df 3.4
Results are ranked by “confect”, confident effect size. If you select genes with abs(confect) >= some value, those genes will have a magnitude of effect greater than that threshold, with controlled FDR. (Exactly the same set of genes is found using treat with this threshold.)
Compared to the “most significant” gene, the top gene by topconfects has a larger slope (but lower overall average expression).
hbby <- log2_cpms["Hbb-y",]
hbby_fit <- lm(hbby ~ tooth * day, data=teeth)
look(hbby, hbby_fit)

# Highlight genes with slope confidently larger than 0.2 per day.
ggplot(result$table, aes(x=AveExpr, y=effect)) +
geom_point(size=0.1) +
geom_point(
data=filter(result$table, abs(confect) >= 0.2),
size=0.5, color="red")

See also “False Coverage-statement Rate” for a generally applicable approach to multiple-testing corrected confidence intervals.
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
##
## attached base packages:
## [1] splines stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] topconfects_1.10.0 emmeans_1.7.4-1 lmerTest_3.1-3 lme4_1.1-29
## [5] Matrix_1.4-1 forcats_0.5.1 stringr_1.4.0 dplyr_1.0.9
## [9] purrr_0.3.4 readr_2.1.2 tidyr_1.2.0 tibble_3.1.7
## [13] ggplot2_3.3.6 tidyverse_1.3.1 edgeR_3.36.0 limma_3.50.3
## [17] multcomp_1.4-19 TH.data_1.1-1 survival_3.3-1 mvtnorm_1.1-3
## [21] MASS_7.3-57
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-157 pbkrtest_0.5.1 fs_1.5.2
## [4] lubridate_1.8.0 bit64_4.0.5 httr_1.4.3
## [7] numDeriv_2016.8-1.1 tools_4.1.0 backports_1.4.1
## [10] bslib_0.3.1 utf8_1.2.2 R6_2.5.1
## [13] DBI_1.1.2 mgcv_1.8-40 colorspace_2.0-3
## [16] withr_2.5.0 tidyselect_1.1.2 bit_4.0.4
## [19] compiler_4.1.0 cli_3.3.0 rvest_1.0.2
## [22] xml2_1.3.3 sandwich_3.0-1 labeling_0.4.2
## [25] sass_0.4.1 scales_1.2.0 digest_0.6.29
## [28] minqa_1.2.4 rmarkdown_2.14 pkgconfig_2.0.3
## [31] htmltools_0.5.2 dbplyr_2.1.1 fastmap_1.1.0
## [34] highr_0.9 rlang_1.0.2 readxl_1.4.0
## [37] rstudioapi_0.13 jquerylib_0.1.4 generics_0.1.2
## [40] farver_2.1.0 zoo_1.8-10 jsonlite_1.8.0
## [43] vroom_1.5.7 magrittr_2.0.3 Rcpp_1.0.8.3
## [46] munsell_0.5.0 fansi_1.0.3 lifecycle_1.0.1
## [49] stringi_1.7.6 yaml_2.3.5 grid_4.1.0
## [52] parallel_4.1.0 crayon_1.5.1 lattice_0.20-45
## [55] haven_2.5.0 hms_1.1.1 locfit_1.5-9.5
## [58] knitr_1.39 pillar_1.7.0 boot_1.3-28
## [61] estimability_1.3 codetools_0.2-18 reprex_2.0.1
## [64] glue_1.6.2 evaluate_0.15 modelr_0.1.8
## [67] vctrs_0.4.1 nloptr_2.0.2 tzdb_0.3.0
## [70] cellranger_1.1.0 gtable_0.3.0 assertthat_0.2.1
## [73] xfun_0.31 xtable_1.8-4 broom_0.8.0
## [76] coda_0.19-4 ellipsis_0.3.2
Home